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Abstract 

^- A direct implementation of the bilateral filter ||T1 requires 0{a'^) 

C) operations per pixel, where CTs is the (effective) width of the spatial 

jy^ kernel. A fast implementation of the bilateral filter was recently pro- 

O posed in 1 19| that required 0(1) operations per pixel with respect to 

'""' (Ts . This was done by using trigonometric functions for the range ker- 

P^ nel of the bilateral filter, and by exploiting their so-called shiftability 

J> property. In particular, a fast implementation of the Gaussian bilateral 

OO filter was realized by approximating the Gaussian range kernel using 

^^ raised cosines. Later, it was demonstrated in |24| that this idea could 

. _^ be extended to a larger class of filters, including the popular non-local 

means filter [2, 3J. As already observed in [19J, a flip side of this ap- 
^!J proach was that the run time depended on the width cir of the range 

Z^ kernel. For an image with dynamic range [0, T], the run time scaled as 

^_^ 0(r^/cr^) with ar- This made it difficult to implement narrow range 

IL" kernels, particularly for images with large dynamic range. In this 

. ^H paper, we discuss this problem, and propose some simple steps to ac- 

^\. celerate the implementation, in general, and for small a,, in particular. 

We provide some experimental results to demonstrate the acceleration 
that is achieved using these modifications. 

Keywords: Bilateral filter, non-local means, shiftability, constant-time 
algorithm, Gaussian kernel, truncation, running maximum, max filter, re- 
cursive filter, 0(1) complexity. 
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1 Introduction 

The bilateral filter is an edge-preserving diffusion filter, which was intro- 
duced by Tomasi et al. in [IJ. The edge-preserving property comes from 
the use of a range kernel (along with the spatial kernel) that is used to 
control the diffusion in the vicinity of edges. In this work, we will focus on 
the Gaussian bilateral filter where both the spatial and range kernels are 
Gaussian llj. This is given by 

f{x) = - / gaA^- y) 9ar{f{x -y)- f{x)) f{x - y) dy (1) 



where 



/ 5a, (a; - y) gaAfix - y) - f{x)) dy. 
Jn 



Here, 5^3 (a^) is the centered Gaussian distribution on the plane with variance 
a^, and 5^^ (s) is the one-dimensional Gaussian distribution with variance 
a^; Q, is the support of ga^ (x) over which the averaging takes place. We call 
^CTs (x) and (7crr («) the spatial and the range kernel. 

The range kernel is controlled by the local distribution of intensity. Sharp 
discontinuities (jumps) in intensity typically occur in the vicinity of edges. 
This is picked up by the range kernel, which is then used to inhibit the 
spatial diffusion. On the other hand, the range kernel becomes inoperative 
in regions with smooth variations in intensity. The spatial kernel then takes 
over, and the bilateral filter behaves as a standard diffusion filter. Together, 
the spatial and range kernels perform smoothing in homogeneous regions, 
while preserving edges at the same time IITl . 

The bilateral filter has found widespread use in several image processing, 
computer graphics, and computer vision applications ||4ji5l|6l|8H9l; see lfTO| 
for further applications. More recently, the bilateral filter was extended 
by Baudes et al. ||2l in the form of the non-local means filter, where the 
similarity between pixels is measured using patches centered around the 
pixel. 

1.1 Fast bilater filter 

The direct implementation of ([TJ is computationally intensive, especially 
when as is large (cJr has no effect on the run time in this case). In particular, 
the direct implementation requires 0{a'^) operations per pixel. This makes 
the filter slow for real-time applications. Several efficient algorithms have 
been proposed in the past for implementing the filter in real time, e.g., see 



||IU[T2l[T3[T3|26ll2ZI- In im, Porikli demonstrated for the first time that the 
bilateral filter could be implemented using 0(1) operations per pixel (with 
respect to as)- This was done for two different settings: (a) Spatial box filter 
and arbitrary range filter, and (6) Arbitrary spatial filter and polynomial 
range filter. The author extended (6) to the Gaussian bilateral filter in ([l} 
by approximating go-,, (s) with its Taylor polynomial. The run time of this 
approximation was linear in the order of the polynomial. The problem with 
Taylor polynomials, how^ever, is that they provide good approximations of 
9ar (s) only locally around the origin. In particular, they have the following 
drawbacks: 

• Taylor polynomials are not guaranteed to be positive and monotonic 
away from the origin, where the approximation is poor. Moreover, 
they tend to blow up at the tails. 

• It is difficult to approximate g^^ (s) using the Taylor expansion when 
Ur is small. In particular, a large order polynomial is required to get 
a good approximation of a narrow Gaussian, and this considerably 
increases the run time of the algorithm. 

The first of these problems was addressed in I1T9| . In this paper, the au- 
thors observed that it is important that the kernel used to approximate g^^ (s) 
be positive, monotonic, and symmetric. While it is easy to ensure symmetry, 
the other two properties are hard to enforce using Taylor approximations. 
It was noticed that, in the absence of these properties, the bilateral filter in 
|IT4ll created strange artifacts in the processed image (cf. Figure 3 in [19J). 
The authors proposed to fix this problem using the family of raised cosines, 
namely, functions of the form 

ct>{s) = [cos [—)\ {-T<s<T). (2) 

Here N is the order of the kernel, which controls the width oi (p{s). The 
kernel can be made narrow^ by increasing N. 

The key parameter in ^ is the quantity T. The idea here is that [cos(s)]^ 
is guaranteed to be positive and monotonic provided that s is restricted to 
the interval [— 7r/2, 7r/2]. Note that the argument s in (|2]| takes on the values 
\f{x — y) — f{x)\ as X and y varies over the image. Therefore, by letting 

T = max max \f{x — y) — f{x)\, 

one could guarantee (j){s) to be positive and monotonic over [—T,T]. In 
|[T9J, T was simply set to the maximum dynamic range, for example, 255 for 



grayscale images. We refer the readers to Figure 2 in ||T9| for a comparison 
of (|2]| and the polynomial kernels used in [14] . It was later observed in p4| 
that polynomials could also be used for the same purpose. The pol5momials 
suggested were of the form 

4>{s) = (l - ^J i-T < .^ < T). (3) 

1.2 Fast 0(1) implementation using shif table kernels 

For completeness, we now explain how the above kernels can be used 
to compute ([T]) using 0(1) operations. As observed in [24J, ^ and ^ are 
essentially the simplest kernels that have the so-called property of shiftability. 
This means that, for a given N, we can find a fixed set of basis functions 

(/)i(s), . . . , (pNis) and coefficients ci, . . . , cn, so that for any translation r, we 
can w^rite 

</,(s - r) = ci(t)0i(s) + • • • + cn{t)(I)n{s). (4) 

The coefficients depend continuously on r, but the basis functions have 
no dependence on r. For (|2]|, both the basis functions and coefficients are 
cosines, while they are polynomials for (|3]|. This shiftability property is at 
the heart of the 0(1) algorithm. Let f{x) denote the output of the Gaussian 
filter Qa^ (x) with neighborhood ft, 



f{x) = / gaAx-y)f{y)dy. 
Jn 



(5) 



Note that, by replacing gcr^ (s) with (/>(s), we can write ([1} as 



fix) = - Iciifix)) Fi{x) + --- CNifix)) Fn{x)] , (6) 

where we have set Fi{x) = f{x)(f)i{f{x)). Similarly, by setting Gi{x) = 
4>i{f{x)), we can write 

7? = ci{f{x)) Gl{x) + ■•■ + CNifix)) G^ix). (7) 

Now, it is well-known that certain approximation of (|5]l can be computed 
using just 0(1) operations per pixel. These recursive algorithms are based 
on specialized kernels, such as the box and the hat function ||2T1|23I , and the 
more general class of box splines [16J. Putting all these together, we arrive 
at the follow^ing 0(1) algorithm for approximating ([l}: 



1. Fix N, and approximate gar{s) using ^ or ^. 

2. For i = 1,2,. . . ,N, set up the images Fi{x) = f{x)(j)i{f{x)) and 
Gi{x) = cj)i{f{x)), and the coefficients Ci{f{x)) . 

3. Use a recursive 0(1) algorithm to compute each Fi{x) and Gi{x). 

4. Plug these into Q and to get the filtered image. 

It is clear that better approximations are obtained when A^ is large. On 
the other hand, the run time scales linearly with N. One key advantage of 
the above algorithm, however, is that the Fi{x) and Gi{x) can be computed 
in parallel. For small orders {N < 10), the serial implementation is found to 
be comparable, and often better, than the state-of-the-art algorithms. The 
parallel implementation, however, turns out to be much faster than the 
competing algorithms, at least for N < 50. Henceforth, we will refer to the 
above algorithm as SHIFTABLE-BF, the shiftable bilateral filter. 

1.3 Gaussian approximation for small cr^ 

This brings us to the question as to whether we can always work with, say, 
A^ < 50 basis functions, in SHIFTABLE-BF? To answer this question, we 
must explain in some detail step (1) of the algorithm, where we approximate 
the Gaussian range kernel, 

on the interval [— T, T]. This could be done either using (|2]|, 

N 



9ar{s) = lira 

N — >oo 



COS 



or, using (|3|, 



Na 



2 \N 



(8) 



These approximations were proposed in |T9l l24l. Note that, we have to 
rescale (|2| by ^/N, and (|3| by N, to get to the right limit. On the other 
hand, to ensure positivity and monotonicity, we need to guarantee that 
the arguments of ^ and ^ are in the intervals [— vr/2, 7r/2] and [0, 1]. A 
simple calculation shows that this is the case provided that N is larger than 



No = 4TVvrV2 = 0. 405 (T/fJr) 2 for the former, and Nq = 0.5{T/arf for the 
latter. In other words, it is not sufficient to set N large - it must be at least be 
as large as Nq. In Table [ij we give the values of Nq for different values of cxr 
when T = 255. It is seen that A'^o gets impracticable large for ar < 30. This 
does not come as a surprise since it is well-known that one requires a large 
number of trigonometric functions (or polynomials) to closely approximate 
a narrow Gaussian on a large interval. As pointed out earlier, this was also 
one of the problems in 1.14,1 . 

Table 1: The threshold A^o for different ar (T = 255). 



ar 


5 


10 


20 


30 


40 


60 


80 


100 


No 


1053 


263 


66 


29 


16 


7 


4 


3 



1.4 Present Contributions 

In this paper, we address the above problem, namely that Nq grows as 
0{T'^ /a1) with Gr- In Sectional we propose a fast algorithm for determining 
T exactly. Besides cutting down No, this is essential for determining the 
(local) dynamic range of a grayscale image that has been deformed, e.g., 
by additive noise. Setting T = 255 in this case can lead to artifacts in the 
processed image. Next, in Section |3l we provide a simple and practical 
means of reducing the order, which leads to quite dramatic reductions in the 
run time of SHIFTABLE-BF. These modifications are also applicable to the 
shiftable algorithms proposed in 11191124) . Finally, in Section [4l we provide 
some experimental results to demonstrate the acceleration that is achieved 
using these modifications. We also compare our algorithm with the Porikli's 
algorithms il4J, both in terms of speed and accuracy. 

2 Fast algorithm for finding T 

For the rest of the discussion, we work with finite-sized images (bounded ^) 
on the Cartesian grid. We continue to use x and y to denote points on the 
grid. The integral in ^ is simply replaced by a finite sum over Q. We will 
use the norm ||a;|| = |xi| + |x2|, wherea; = (xi,X2). Without loss of generality, 
we assume that J] is a square neighborhood, that is, 17 = {aj : ||a;|| < i?} 
where, say, R = Sus (if il is not rectangular, we take the smallest rectangle 
containing il). 



for 



Note that, for a given ar, we can cut down Nq by using a tight estimate 



max max I f(x 

33 \\y\\<B. 



y) 



/(^)l- 



(10) 



The smaller the estimate, the lower is the threshold Nq. The point is that the 
w^orst-case estimate T = 255 is often rather loose for grayscale images. For 
example, we give the exact values of T for a test image in Table |2| computed 
at different values of as- We also give the time required to compute T. 



Table 2: Exact values of T for the standard 512 x 512 Lena. Also shown is 
the time needed to compute T using Matlab. 



CTs 


1 


3 


5 


10 


15 


20 


30 


T 


153 


205 


208 


210 


211 


215 


215 


time (sec) 


2.06 


2.13 


2.33 


2.71 


3.26 


4.01 


5.60 



It is seen that the exact values of T are indeed much less than the worst- 
case estimate, particularly for small ag. For a^ = 3, T is only about 205. Even 
for as as large as 30, T is about 215. Consider the bilateral filter with as = 10 
and ar = 10. From Table [l| A'^o = 263 using T = 255. However, using the 
exact value T = 210, we can bring this down to 263 • (210/255)^ ^ 178, a 
reduction by almost 100. For smaller values of ar, this gain is even more 
drastic. However, notice the time required to compute T in Table [l] This 
increases quickly with the increase in as (in fact, scales as 0{B?)). Experi- 
ments show us that, for large as, this is comparable to the time required to 
compute the bilateral filter. It would thus help to have an 0(1) algorithm for 
computing T. Motivated by our previous work on filtering using running 
sums HH, we recently devised an algorithm that does exactly this. We later 
found that the algorithm had already been discovered two decades back in 
a different context IITtIITsI. 

Our algorithm is based on the following observations. First, note that 
we can take out the modulus from pO| | using symmetry. 



Proposition 2.1 (Simplification). 



max 



f{x) - max f{x - y) 

\\v\\<R 



(11) 



Proof. This follows from the observations that |t| = max(i, —t), and that 
||a^ — yll < ^ is symmetric in x and y. Moreover, note that the operation 
that takes two numbers a and h and returns max(a, b) is associative. Using 



associativity, we can write ( [T0| as T = max(T+, T_), where 

(12) 



and 



max 



max 

X 



f{x) - max f{x - y) 

\\y\\<R 



max f{x -y)- f{x) 

\\y\\<R 



(13) 



We claim that T+ = r_, so that we need not compute them separately. 
Indeed, suppose that the first maximum is attained at xq and y^, that is, 
r+ = /(ajo) - f{xQ - y^). Taking a; = a^o - Z/o and y = -y^, and noting that 
\\x — y\\ < -R, we must have 

T- > fix -y)- fix) = fixo) - fixo - yo) = T+. 

By an identical argument, r+ > T_, and the proposition follows. D 

The problem is now reduced to that of computing the windowed maxi- 
mums in pi) . A direct computation would still require 0(i?^) comparisons. 
It turns out that we can do this very fast (no matter how large is R) by 
exploiting the overlap between adjacent windows. This is done by adapting 
the so-called MAX-FILTER algorithm. 

Proposition 2.2 (Max-Filter Algorithm). There is an Oil) algorithm for com- 
puting 

max fix-y) 
\\y\\<R 

at every x. 

This algorithm was first proposed by van Herk, and Gil and Werman 
||T7l[l8|. It is clear that since the search domain Q is separable, it suffices 
to solve the problem in one dimension. The problem in two-dimensions 
can be solved simply by iterating the one-dimensional MAX-FILTER along 
each dimension. From ( [TTj l, we arrive at Algorithm IT] for computing T 
with 0(1) operations. We note that Algorithm [l] does not actually compute 
max {|/(a; — y) — fix)\ : \\y\\ < R} at every x. It only has access to the 
distribution of the maximums. For completeness, we have explained the 
MAX-FILTER algorithm in the Appendix. For further details, we refer the 
readers to CZlllll. 
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Figure 1: Approximation of the target Gaussian gar{s) on the interval 
[—255, 255] using the raised cosine 0(s) in (|8]l. We have used dr = 10 in 
this case, for which A'^o = 263. Top: The solid blue line shows the target 
Gaussian g^,. (s), and the broken red line is the raised cosine </)(s). We trun- 
cate 4>{s) to obtain (j)£{s) in p6| , where we used e = 0.005. In this case, 
M = 111. Thus, a total of 2 x M = 222 terms are dropped from the se- 
ries. The truncation (/>e(s) has only 42 terms. Bottom: The solid blue line 
is ga,. (s), and the broken red line is 0e(s). Note that the quality of approx- 
imation is reasonably good even after discarding almost 84% terms. As a 
result of the truncation, small oscillations of size e emerge on the tails. The 
approximation around the origin is positive and monotonic. 



Algorithm 1 Fast algorithm for computing T 
Input: Image f{x) and R. 



Return: Tasini 10 1. 



1. Setfias window of MAX-FILTER. 

2. Apply MAX-FILTER along each row of f{x); return image m{x). 

3. Apply MAX-FILTER along each column of m{x); return image M{x) 

4. Set T as the maximum of f{x) — M{x) over all x. 



3 Acceleration using truncations 

We have seen that, by using the exact value of T, we can bring dow^n the 
run time by 10 — 20%. Unfortunately, Tablejlltells us that this alone is not 
sufficient in the regime a-r < 15. For example, A'^o is of the order 10^ in 
the regime a-r < 5. So why do we require so many terms in (JSJ) and ^ to 
approximate a narrow Gaussian? This is exactly because we are forcing 0(s) 
to positive and monotonic on its broad tails, where gaA^) is close to zero. 
For example, consider the approximation in (|8]|: 



N 



E2- 

n=0 



■N 



N 



n 



cos 



{2n-N)s 



Nar 



(14) 



By requiring N > Nq, we can guarantee that (1) (/)(s) is close to g^j^ (s), and 
(2) 0(s) is positive and monotonic over [— T, T]. Note, however, that gar{s) 
falls off very fast, and almost vanishes outside ±3(7,.. It turns out that only 
a few significant terms in p4| | contribute to the approximation in the ztScir 
region. The rest of the terms have a negligible contribution, and are required 
only to force positivity at the tails. 



Table 3: The threshold A^o before and after truncation, tolerance e 
(worst-case setting T = 255). 



ar 


3 


5 


8 


10 


12 


15 


No (before) 


2929 


1053 


413 


263 


184 


119 


No (after) 


95 


77 


53 


43 


36 


29 


% of terms dropped 


96 


92 


88 


85 


82 


77 



0.05 



Thanks to expression ( [T4| |, it is now straightforward to determine which 
are the significant terms. Note that the coefficients 2^^(^) in p4| are 
positive and sum up to one. In fact, they are unimodal and closely follow 
the shape of the target Gaussian. The smallest coefficients are at the tails, and 



10 



the largest coefficients are at the center. In particular, the smallest coefficient 
is 1/2^, while the largest one is 2~^[{N/2)\]-'^N\ (assuming N to be even). 
For large A^, the latter is approximately ^JljixN using Stirling's formula. 
Thus, as N gets large, the coefficients get smaller. What is perhaps significant 
is that the ratio of the smallest to the largest coefficient is (7rA^2^^^^)^^' ^, 
and this keeps shrinking at an exponential rate with N . On the other hand, 
the cosine functions (which act as the interpolating function) are al^vays 
bounded between [—1, 1]. 

The above observation suggests dropping the small terms on the tail. In 
particular, for a given tolerance e > 0, let M = M(e) be the smallest term 
for which 



n=0 ^ 



and set 



n=M \ V I / 

It follows that the error |(/)(s) — (/)e(s)| is within e for all — T < s < T. Note 
that 0e(s) is symmetric, but is no longer guaranteed to be positive on the 
tails, where oscillations begin to set in. However, what we can guarantee is 
that the negative overshoots are within — e. In fact, the quality of the final 
approximation turns out to be quite satisfactory. This is illustrated w^ith an 
example in Figure IT] The main point is that, in the regime Or < 15, we are 
now^ able to bring down the order to well within 100. We list some of them 
in Table Is] Notice that we can drop more terms for a given accuracy as the 
kernel gets narrow. For ar < 5, we can drop almost 95% of the terms, while 
keeping the error within 0.5% of the peak value. 

Note that p5| actually requires us to compute a large number of tails 
coefficients, which are eventually not used in p6| . It is thus better to estimate 
M when A^ is large. A good estimate of p5| l is provided by the Chernoff 
bound for the binomial distribution ||22| , namely. 



n=0 ^ ^ 



It can be verified the estimate is quite tight for N > 100. By setting the 
bound to e/2, we get 

M = ^(^N- V4iVlog(2/e)) . (17) 

11 



Algorithm 2 Improved SHIFTABLE-BF 

Input: Image f{x), variances a"^ and a"^, and tolerance e. 
Return: Bilateral filtered image f{x). 

1. Set R to some factor of as (determined by size of spatial Gaussian). 

2. Input fix) and R to Algorithm[l| and get T. 

3. SetiV = 0.405(r/(T^)2. 

4. Assign M using the following rule: 

(a) [ar > 40] Set M = 0. 

(b)[10 <ar< 40] Compute M from ^, given N and e. 



(c) [cr-r < 10] Plug iV and e into (|T7|) to get M. 



5. Use A^ and M to specify ^^(a;) in p6| . 

6. Using (/)£(a;) as the range kernel, input f{x) to SHIFTABLE-BF to get 

/»■ 



The final algorithm obtained by combining the proposed modifications 
is given in Algorithm [2] Henceforth, we will continue to refer to this as 
the SHIFTABLE-BF. The Matlab implementation of SHIFTABLE-BF can be 
found here |j201. 

4 Experiments 

We now provide some results on synthetic and natural images to understand 
the improvements obtained used our proposal. While all the experiments 
were done on Matlab, we took the opportunity to report the run time of a 
multithreaded Java implementation of Algorithmic] All experiments were 
run on an Intel quad core 2.83 GHz processor. 

4.1 Run time 

First, we tested the speedup obtained using Algorithm IT] for computing T. 
We used a Matlab implementation of this algorithm [20|. It is expected that 
the run time remain roughly the same for different o-^. As seen in table [4] 
this is indeed the case. The run time of the direct method, for the same 
image and the same settings of ag, "was already provided in table |2] Note 
that we have been able to cut dow^n the time by a few orders using our fast 
algorithm. 

We then compared the run times of multithreaded Java implementations 
of SHIFTABLE-BF proposed in | [T9| and its present refinement. For this, we 
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Table 4: Average time required to compute the exact value of T using 
Algorithm [T| We used the standard 512 x 512 Lena. See also Table |2] 



CTs 


1 


3 


5 


10 15 20 30 


Time (millisec) 


70 


71 


73 


71 70 72 71 



Table 5: Comparisor\ of the run times of the multithreaded Java implemer\- 
tations of SHIFTABLE-BF (as in [19|) and its proposed improvement, for 
different values of cjr (fixed as = 15). We used the test image Checker shown 
in Figure [2J Shown in the table are the different e used in Algorithmic] We 
use cx) to signify that the run time is impracticably large. 



Ur 


5 


8 


10 


12 


15 


20 


SHIFTABLE-BF (millisec) 


oo 


oo 


oo 


2130 


1280 


800 


Improved SHIFTABLE-BF (millisec) 


880 


500 


350 


270 


250 


150 


e (% of peak value) 


3 


2 


2 


1 


1 


1 



used the test image Checker shown in Fig. |2] We have used small values 
of Ur, and a fixed Us = 15. The average run times are shown in Table [5] 
The tolerance e used for the truncation are also given. We use a smaller e 
(larger truncation) as o-^ gets small. Notice how we have been able to cut 
down the run time by more than 70%. This is not surprising, since we have 
discarded more than 85% of terms. Notice that the run times are now well 
within 1 second. The run time of the direct implementation (which does 
not depend on o"r) ^vas around 10 seconds. The main point is that we can 
now implement the filter in a reasonable amount of time for small Gr, which 
could not be done previously in 111 91 . 



4.2 Accuracy 

We next studied the effect of truncation. To get an idea of the noise that 
is injected into the filter due to the truncation, we used the Checker image. 
This particular image allowed us to test both the diffusive and the edge- 
preserving properties of the filter at the same time. We used the setting 
Gs = 30 and Gr = 10. First, we tried the direct implementation of Q, using a 
very fine discretization. Then we tried Algorithm [2J The difference between 
the two outputs is shown in Figure [3] The artifacts shown in the image are 
actually quite insignificant, within 10^^ times the peak value. Notice that 
most of the artifacts are around the edges. This comes from the oscillations 
induced at the tails of kernel by the truncation. To compare the filter outputs 
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Figure 2: Test image Checker of size 256 x 256 consisting of black (intensity 0) 
and white (intensity 255) squares. The bilateral filter acts as a diffusion filter 
in the interior of the squares, and as an edge-preserving filter close to the 
boundary. It preserves both the constant-intensity regions, and the jumps 
across squares. 

(with and without truncation), we extracted two horizontal scan profiles 
from the respective outputs. These are shown in Figured Notice that it is 
rather hard to distinguish the two. 

We then applied the filters on the standard grayscale image of Lena. We 
first applied the direct implementation followed by Algorithm |2] In this 
case, T was computed to be 215. The difference image is shown in FigurelSJ 
It is again seen that the small artifacts are cluttered near the edges. We have 
also tried measuring the mean-squared-error (MSE) for different cjr. The 
results are given in Table [6| Note that relatively larger MSEs are obtained 
at small ar- This is because we are forced to use a large truncation to speed 
up the filter at small Ur- The above results show that we can drastically 
cut down the run time of filter using the proposed modifications, without 
incurring significant errors. 

Table 6: The mean-squared-error (MSE) between the filter outputs before 
and after truncation. We use the Lena image, and a fixed cr^ = 30. The 
toleran ce e is chosen as in Table Isl 



10 12 15 20 



lOlogio(MSE) -9.3 -11.1 -11.8 -13.5 -13.8 -14.1 
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Figure 3: The difference between the outputs of the direct (high resolution) 
implementation of ([l}, and that obtained using Algorithm |2] The test image 
in Figure |2] was used as the input, and settings were ag = 30 and ar = 10. 
The noise created due to the truncation is actually very small - the error is 
within 10^^ times the peak value. See the comparison of scan profiles in 
Figure HI 
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(a) High resolution implementation. 
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Figure 4: Comparison of the respective scan profiles from the bilateral filter 
outputs (cf. description in Figure |3). 




Figure 5: The difference between the outputs for the image Lena (size 512 x 
512). The settings were ag = 30 and a-r = 10. The noise created due to 
the truncation is within 10"'^ times the peak value, and is thus practically 
insignificant. Notice that the errors are mainly around the edges. 
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4.3 Comparison with a benchmark algorithm 

We next compared the performance of the improved SHIFTABLE-BF algo- 
rithm with those proposed in IIT4l . The latter algorithms are considered as 
benchmark in the literature on fast bilateral filtering. Porikli proposed a 
couple of algorithms in [14J - one using a variable spatial filter and a poly- 
nomial range filter (we call this BFl), and the other using a constant spatial 
filter and a variable range filter (we call this BF2). The difficulty with BFl is 
that it is rather difficult to control the width of the polynomial range filter. 
In particular, as was already mentioned in the introduction, it is difficult 
to approximate narrow Gaussian range kernels using BFl. We refer the 
interested readers to the experimental results in llT9l , where a comparison 
was already made between BFl and SHIFTABLE-BF. For completeness, 
we perform a single experiment to compare these filters when (t,. is small 
(a rather large value of ar "was used in the experiments in [19J). For this, 
and the remaining experiments, we will consider the standard test image 
of Barbara of size 512 x 512. This image has several texture patterns, and is 
well-suited for comparing the performance of bilateral filters with narrow 
range kerneldj In particular, we consider the Gaussian bilateral filter with 
as = 20 and ar = 20. The results obtained using SHIFTABLE-BF and BFl 
are shown in Figure [6] The error between the direct implementation of the 
bilateral filter and SHIFTABLE-BF was within 10"^. On the other hand, 
note how BFl completely breaksdown. The reason for this was already 
mentioned in the introduction. A similar breakdown, with a larger ar, 'was 
also observed in Figure 3 in |[T9| . 

We next considered BF2, which does not suffer from the above problem. 
However, we note that BF2 cannot be used to perform Gaussian bilater 
filtering - it only w^orks with constant spatial filters (box filters). This is 
because BF2 uses fast integral histograms, and this only works for box filters. 
To make the comparison even, we considered bilateral filters with constant 
spatial filters and Gaussian range kernels. We note SHIFTABLE-BF can be 
trivially modified to work with arbitrary spatial filters. 

First, we compared the run times of the Matlab implementations of 
SHIFTABLE-BF and BF2. The results obtained at particular settings of 
as (radius of box filter) and ar are shown in Figure [tI We see that the 
run time of SHIFTABLE-BF is consistently better than that of BF2. The 
difference is particularly large when ar > 10, and it closes down as ar gets 
small. All these can be perfectly explained. Note that, as per the design, 
the computational complexity of BF2 is 0(1) both with respect to ag and ar- 

^we thank one of the reviewers for suggesting this example. 
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(a) Output of SHIFTABLE-BF. (b) Output of Porikli's algorithm. 

Figure 6: Comparison of the bilateral filtering results obtained using 
SHIFTABLE-BF and Porikli's polynomial-kernel algorithm [14J. In both 
cases, we used a Gaussian spatial filter {as = 20) and a Gaussian range filter 
((Tr- = 20). For our algorithm, we set e = 0.03. For Porikli's algorithm, we 
used a Taylor polynomial with order comparable to that of the raised cosine 
kernel. 



This is indeed seen to be the case from the run times. On the other hand, 
the computational complexity of SHIFTABLE-BF is 0(1) with respect to as 
(this is again clear from the plots in Figure [tI). However, for a given as, the 
complexity of the the original algorithm scales as 0{l/a'^). The complexity, 
in fact, remains roughly the same even after the proposed truncation. This 
explains the step rise in the run time for small values of ar, as shown in 
Figure [7| However, the actual run time goes down substantially as a result 
of the truncations (cf. Table |5]). In particular, we have noticed that the worst 
case run time of SHIFTABLE-BF is less than the average run time of BF2 
for ar as low as 3. 

We note that the run time of BF2 depends on the number of bins used 
for the integral histogram. In the above experiments, we used as many bins 
as the grayscale levels of the image. It is thus possible to reduce the run time 
by cutting down the resolution of the histogram. However, this comes at the 
cost of the quality of the filtered image. This lead us to compare the outputs 
of SHIFTABLE-BF and BF2. In Figure|8} we compared the MSEs of the two 
algorithms for different as and a,- for the image Barbara. We note that the 
MSE for SHIFTABLE-BF is significantly lower than BF2. The gap is around 
30 dB for a-r > 10, and around 90 dB w^hen ar < 10. We noticed that this 
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Figure 7: Comparison of the run times for different as and Ur- We compare 
our algorithm SHIFTABLE-BF with Porikli's algorithm BF2 (using integral 
histograms IHI). In either case, we use a constant spatial filter and a Gaus- 
sian range filter. For our algorithm, we set e = 0.01. For Porikli's algorithm, 
we use the full resolution (256 bin) histogram. Both the algorithms were 
implemented in Matlab. 
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Figure 8: Comparison of the MSE for different as and Ur- The MSEs are 
computed between the direct implementation and SHIFTABLE-BF, and 
between the direct implementation and BF2. The parameter settings are 
identical to those used in Figure [7[ 
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Figure 9: Dependence of order on 0-^. Here N and M are as defined in 
Algorithm [2] The sudden jump at o-,. = 10 is due to the truncation rule in 
dTTl). 



difference becomes even more pronounced if we use a smaller number of 
bins. As expected, note that the individual MSEs do not vary much with (jg. 
The reader will notice that the MSB of SHIFTABLE-BF suddenly drops by 
60 dB when ar < 10. To explain this, we plot the effective order N — 2M for 
different ar in Figure [9] We see that the order (of approximation) suddenly 
jumps up when ar goes below 10, which explains the jump in the MSE in 
Figure Is] This is simply due to the rule p7| | used in Algorithmic] 

In Figure 10 the filtered outputs of the two algorithms are compared 
with the direct implementation, for a small value of ar- Note that the 
pointwise error between the direct implementation and SHIFTABLE-BF is 
of the order 10^'^. On the other hand, the corresponding error between the 
direct implementation and BF2 is substantial, a few orders larger than that 
for SHIFTABLE-BF. This explains the large gap between the MSEs in Figure 

El 

We close this section by commenting on the memory usage of SHIFTABLE-BF 
and BF2. The former requires us to compute and store a total of 2{2N — M) 
images, while the latter requires us to store a histogram with B bins per 
pixel (equivalent of B images). It is clear from Figure [9] that even for a half- 
resolution histogram {B = 128) and for cr^ > 10, the memory requirement 
of BF2 is comparable to that of SHIFTABLE-BF. For smaller values of ar, 
SHIFTABLE-BF clearly requires more memory than BF2. 
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(a) Test image Barbara (size 512 x 512). (b) Bilateral filter output (direct implementa- 
tion). 




(c) Error between (b) and our method. (d) Error between (b) and Porikli's method. 

Figure 10: Comparison of the results obtained using SHIFTABLE-BF with 
the direct implementation (high resolution) and Porikli's algorithm BF2. In 
all three cases, we used a constant spatial filter (radius = 20) and a Gaussian 
range filter {ar = 5). For SHIFTABLE-BF, we set e = 0.01. For BF2, we 
used the best possible resolution (256 bin histogram). The run times of the 
Matlab implementations were: Direct implementation (37 seconds), BE 2 (13 
seconds), and SHIFTABLE-BF (6 seconds). 
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5 Discussion 

In this paper, we proposed some simple ways of accelerating the bilateral 
filtering algorithm proposed in [19J. This, in particular, opened up the 
possibility of implementing the algorithm in real-time for small Ur- We 
note that the problem of determining the optimal as and ar for a given 
application is extrinsic to our algorithm. We are only required to determine 
the parameter T, which is intrinsic to our algorithm. A fast algorithm was 
proposed in the paper for this purpose. However, we note that having a 
fast algorithm does make it easier to determine the optimal parameters. 
In this regard, we note that Kishan et al. have recently shown how our 
fast algorithm can be used to tune the parameters for image denoising, 
under different noise models Il25ll28l . One crucial observation used in these 
papers is that the a certain unbiased estimator of the MSE can be efficiently 
computed for our fast bilater filter, using the linear expansions in Q and 
(|7]|. The "best" parameters are choosenby optimizing this MSE estimator. 
While this can also be done for the polynomial-based bilateral filter in [14], 
this trick cannot be used for other fast implementations of the bilateral filter, 
at least to the best of our knowledge. 

Finally, we note that the ideas proposed here can also be extended to the 
0(1) algorithm for non-local means given in [24J. In non-local means 12]|, 
the range kernel operates on patches centered around the pixel of interest. 
A coarse non-local means was considered in [24J, where a small patch 
neighborhood consisting of the pixels ui, . . . ,Up (where, say, ui = 0) was 
used. In this case, the main observation was that formula for the non-local 
means can be written in terms of foUow^ing sums: 

/ f{x-y)g{f{x + ui)-fix-y + ui),..., (18) 

J\\v\\<R 



'\\y\\<R 

f{x + Up) - f{x-y + Up)) dy, 



and 

I gif{x + ui)-f{x-y + ui),..., (19) 

A\v\\<R 

f{x + Up) - f{x-y + Up)) dy, 

where g{si, . . . ,Sp) is an anisotropic Gaussian in p variables, and has a 
diagonal covariance. This looks very similar to ([ij, except that we now 
have a multivariate range kernel. By using the separability oi g{si,. . . ,Sp), 
and by approximating each Gaussian component by either (JSJ) or (|9]), a 0(1) 
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algorithm for computing ([Tsjl and ([T9| was developed. We refer the readers 
to ||24| for further details. The key consideration with this algorithm is 
that the overall order scales as A^^, where N is the order of the Gaussian 
approximation for each component. In this case, it is thus important to 
keep N as low as possible for a given covariance. After examining p8| and 



( 19 1, it is clear that the interval over which the one dimensional Gaussians 
need to be approximated is [— T, T], where T is as defined in pO| |. We can 
compute this using Algorithm IT] Moreover, we can further reduce the order 
by truncation, especially when the covariance is small. However, N^ can 
still be large (even for p = 3 or 4), and hence a parallel implementation must 
be used for real-time implementation. 



6 Appendix : MAX-FILTER algorithm 

We explain how the MAX-FILTER algorithm works in one dimension. Let 
fi, f2, ■ ■ ■ , In he given, and we have to compute max(/j_/j, . . . , fi+n) at 
every interior point i. Assume R is an integer, and A^ is a multiple of the 
window size W = 2R + 1, say, N = pW (padding is used if this not the 
case). The idea is to compute the local maximums using running maximums, 
similar to running sums used for local averaging [21 J. The difference here 
is that, unlike averaging, the max operation is not linear. This can be fixed 
using "local" running maximums. 

We begin by dividing fi, f2, ■ ■ ■ , In into p equal partitions. The A:th 
partition {k = 0,1, . . . ,p—l)is composed of fi+kw-, • • • > fw+kw- For a given 
partition k, we recursively compute the two running maximums (of length 
W), one from from the left and one from the right. Let l^'^^ and r^'^) be the 
left and right running maximums for the fcth partition. The sequence l^^^ 

(k) 

start at the left of the partition with /} = fi+kw, arid is recursively given 
by ll = max( ll_\, fi+kw ) for i = 2,3, . . . ,W. It ends on the right end of 
the partition. On the other hand, r^^) start at the right and ends on the left: 
rw = fw+kW, and r}^^_ ■ = max( r)^_ij^i, fw-i+kw ) for i = 1, 2, . . . , T^ - 1. 
This is done for every partition to get /(°\ . . . , /(p~^) and r'^^\ . . . , r^^"^). 

We now concatenate the left maximums into a single function li,. . . ,l]\[, 
that is, we set / = (/^"^^ . . . , 1^p~^^). Similarly, we concatenate the right 
maximums in order, r = {A^', . . . , Ap^^'). In practice, we just need to re- 
cursively compute /i, . . . , /jv and ri, . . . , r^, resetting the recursion at the 
boundary of every partition. We now split fi-R, . . ■ , fi+R into two seg- 
ments, which either belong to the same partition or two adjacent parti- 
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tions. Then, from the associativity of the max operation, it is seen that 

max(/i_i?, . . . , fi+R) = max(ri_i?, /j+i?). Note that, we need just 3 max oper- 
ations per point to get the result, independent of the window size R. 
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